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Abstract 

Many customer services are already avail- 
able at Social Network Sites (SNSs), includ- 
ing user recommendation and media inter- 
action, to name a few. There are strong 
desires to provide online users more dedi- 
cated and personalized services that fit into 
individual's need, usually strongly depend- 
ing on the inner personalities of the user. 
However, little has been done to conduct 
proper psychological analysis, crucial for ex- 
plaining the user's outer behaviors from their 
inner personality. In this paper, we pro- 
pose an approach that intends to facilitate 
this line of research by directly predicting 
the so called Big-Five Personality from user's 
SNS behaviors. Comparing to the conven- 
tional inventory-based psychological analy- 
sis, we demonstrate via experimental stud- 
ies that users' personalities can be predicted 
with reasonable precision based on their on- 
line behaviors. Except for proving some for- 
mer behavior-personality correlation results, 
our experiments show that extraversion is 
positively related to one's status republish- 
ing proportion and neuroticism is positively 
related to the proportion of one's angry blogs 
(blogs making people angry). 

1 INTRODUCTION 

Personality is the particular combination of emotional, 
attitudinal, and behavioral response patterns of an in- 
dividual in psychological definition (Wikipidia, 2012). 
According to the classic Big Five Personality traits 
theory, personality can be divided into five differ- 
ent dimensions which are agreeableness, conscientious- 
ness, extraversion, neuroticism and openness for most 
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cases (Lounsbury, 2006). Agreeableness refers to being 
helpful, cooperative, and sympathetic towards others. 
Conscientiousness is determined by being disciplined, 
organized, and achievement-oriented. Extraversion is 
displayed through a higher degree of sociability, as- 
sertiveness, and talkativeness. Neuroticism refers to 
degree of emotional stability, impulse control, and anx- 
iety. Finally, openness is reflected in a strong intellec- 
tual curiosity and a preference for novelty and variety. 
(Funder, 2001) 

To analyze individual personality is extremely impor- 
tant for many researches. Canada Peer Counseling 
Center (Chen, 1998) considers that for most people 
they have investigated, recommendations from com- 
panion volunteers with same world view as being the 
most effective. It means people with same personal- 
ity tend to attract each other. As a result, analysis of 
different personality features can be the basis for build- 
ing characterized service. For example, an extraversive 
user may have a higher level of online activity which 
is more likely to use recommendation system to make 
new friends with strangers (McElroy, 2012). 

Analyzing outer behaviors is the principle of inner per- 
sonality analysis since behavior is the manifestation of 
personality. In psychological researches, most tradi- 
tional personality analyzing experiments are based on 
self-reported inventory. However, psychological exper- 
iment has its own bottleneck. When experiment par- 
ticipants upload the self-report data, they could have 
reflected self-views rather than actual behavior. Other 
data collection methods such as observable informa- 
tion profile cost a lot of manual resources and are not 
desirable for large scale of dataset collection. At the 
same time, most personality researches can only build 
the covariation relation between behavior and person- 
ality instead of a quantitative personality prediction. 

Facing these disadvantages, we propose an automatic 
and objective personality prediction system based on 
user's behaviors on Social Network Sites (SNSs). On- 
line Social Networking Sites (SNSs) like Facebook and 



RcnRcn (Rcnrcn, 2012) have a quick development dur- 
ing the recent decade. They have already been a part 
of people's life and an extension of real nature. Ac- 
cording to the Chinese social e-commerce Report from 
IResearch, Chinese SNSs have totally 370 million reg- 
istered users in the year 2011 which gets an increase 
of 17.6% compared with the previous year. It is pre- 
dicted that the users count will jump to 510 million 
in 2014 (IResearch, 2011). RenRen, a Chinese version 
Facebook, has the highest Market share in Chincisc; 
SNSs (Baidu, 2009). RenRen provides a wide range 
of functions for information exchanging where people 
can keep connections with each other such as blogging, 
status, and photo/ video-sharing (Boyd, 2007). 

Online SNS behaviors and real world behaviors have 
a lot in common (Lounsbury, 2006). Self-report and 
interactive behaviors are all supported in SNS. There- 
fore, many experts tend to do research on this field. 
Techniques on computer science such as Information 
Retrieval (IR) and Recommendation System are help- 
ful to solve many problems using keyword-resource 
matching and collaborative filtering methods. How- 
ever, along with the social network functions improve- 
ment, it is unavoidable to consider highly for the 
user experience since the user demand is increasing. 
The more characterized systems that connect the net- 
work behavior based personal preference and online 
resources are welcomed. 

Since computer science needs psychological character- 
ized service and psychology needs automatic compu- 
tation, we come up with the idea of building the rela- 
tion between personality and online behavior in Ren- 
ren which uses an automatic computation to predict 
user's personality attributes. With our model, user's 
big-five personality can be predicted based on her SNS 
behavior. In the following section 2, we will show some 
related work on both computer science and psychol- 
ogy. Then we will explain our researching methods in 
section 3. In section 4, we will show our experiment 
results. Finally in section 5, we will make a conclusion 
and discuss our further work. 

2 RELATED WORK 

Previous researches on SNS mostly focus on topologi- 
cal characteristics (Kwak, 2007) , web community min- 
ing(Kevin, 2010) and so on. From these meaning- 
ful results, virtual world is a facsimile version of the 
nature society which follows most sociological princi- 
ples such as Six Degrees of Separation and Rule of 
150 (Yaguang, 2009). It is also found that online 
users tend to join with each other to form some small 
communities. Meanwhile, the growing user demand 
in SNS world triggers the taking off for techniques 



of characterized recommendation (Jie, 2011) and in- 
formation retrieval (Christopher, 2010) recent year. 
Junco Reynol (Reynol, 2011) researched on relation- 
ship between Facebook use and student engagement 
and found that Facebook use was negatively predic- 
tive of engagement scale score and positively predic- 
tive of time spent on SNS. However, these works were 
based on user's statistic information such as common 
friend count, familiar shared resources, time spent on 
SNS or information checked frequency which considers 
user's SNS usage instead of her inner preferences and 
personality. 

Personality is one of the hottest topics in Psychology. 
According to Big Five personality traits theory, per- 
sonality can be divided into five different dimensions 
which are openness, conscientiousness, extraversion, 
agreeableness and neuroticism. Berkeley Personality 
Lab (Berkeley, 2012), focusing on personality, self- 
perception, and individual differences in emotion reg- 
ulation, designed a Big Five Inventory which is wildly 
used around world. It contains 44 questions with high 
validity and reliability and can give back a quantized 
personality score with five dimensions. 

Until now, researches that combine personality and 
SNS together have a few bases (Shaoqi, 2011). Emily 
S. Orr discussed the influence of shyness on the use 
of SNS in undergraduate samples in 2009. He dis- 
covered that shyness was significantly positively cor- 
related with the time spent on SNS and negatively 
correlated with the number of "friends" (Sisic, 2009). 
Meanwhile, Teresa Correa analyzed the intersection 
of users' personality and social media (Correa, 2010) 
and foimd that openness and extraversion had posi- 
tive relation to using experience of social media while 
neuroticism was a negative predictor. However, these 
works could only give the association relation between 
personality and behavior instead of a quantization of 
personality metrics. 

Samuel D. Gosling (Gosling, 2011) experimented on 

the manifestations of personality in SNS. In this re- 
search, a mapping between personality and SNS be- 
havior is announced. They examined the personality 
with self-reported Facebook usage and observable pro- 
file information and finally gave the correlation factor 
between personality and online behavior. They de- 
signed 11 features, friends count, weekly usage and 9 
other functions using frequency. However, their fea- 
tures are all based on statistical characteristics with- 
out any inner properties of user. The data collections 
are based on self-reported usage and observable profile 
information which will need a large amount of manual 
operation. Therefore, experiment objectivity will get 
a discount. 



Generally speaking, most researches on personality 
used only psychological method. No matter self-report 
or observable information profile, they are all not ef- 
ficient for large large-scale data acquisition. At the 
same time, the features they used are only from SNS 
statistic frequency usage. It would be better if some 
emotion-related features (e.g. blog emotion, anger or 
happiness) could be added. The association mode be- 
tween personality and SNS behavior could only give 
the correlation factor instead of predicting personal- 
ity. Although these factors can describe the relation- 
ships between personality and behavior, they can't ac- 
curately quantify personality for an arbitrary testing 
sample. Since psychology and computer science have 
their own advantages as well as disadvantages, we try 
to cross these two subjects and build a predictor sys- 
tem that can qualify user's big five personality based 
on SNS usage and preferences. 

3 METHODS 

In our work, we try to build a personality computa- 
tion and prediction model based on user's online SNS 
usage. We choose the most wildly used Chinese SNS 
Renren as our experiment platform. In this part, we 
will solve the following problems: 
How could we collect a large amount of dataset objec- 
tively and efficiently? 

How could we design the features that distinguishing 
different personality preference? 
How could we build the personality computation and 
prediction model? 



3.1 DATA COLLECTION 

Renren has opened a lot of APIs for third-side ap- 
plication design (Platform, 2012). These third-side 
applications can be divided into three classes, Web 
Access Connections, Web/Wap Applications and Mo- 
bile Client Applications. We have already developed 
an online mental illness treatment website Dao (Dao, 
2012). We then make it a web access connection 
to Renren which allows Renren users login in DAO 
through her/his Renren account. When our experi- 
ment participants login DAO, user authorization can 
be achieved. Then we can call for APIs of Renren to 
collect user historical behavior data and save it into 
otir local database. 

In order to get the labeled data, an inventory is asked 
for finishing by each participant. The Big Five Inven- 
tory (BFI), designed by Berkeley Personality Lab, is a 
self-report inventory to measure the Big-Five personal- 
ity dimensions. It is quite brief for a multidimensional 
personality inventory (44 questions in all) and includ- 



ing short phrases with relatively accessible vocabulary 
(Berkeley, 2012). After finishing this inventory, a per- 
sonality result vector with five dimensions can be saved 
as data labels. Therefore, with these computer appli- 
cation techniques, building a labeled dataset with high 
efficiency can be easily achieved. 

3.2 FEATURE DESIGN 

The initial data we collect could not be used directly, 
so we design 41 features based on BFI and some pre- 
vious work in this field to describe user behavior. The 
features can be divided into 5 groups each of which is 
listed in the table 1 below, where T stands for time, E 
stands for emotion. 

Table 1: Features Design 
FEATURE GROUP COUNT 



Basic Info. 5 

SNS Usage 28 

T-Related Usage 3 

E-Related Usage 2 

T&E-Related Usage 3 

Features of basic information and SNS usage expe- 
riences have already been used by a lot of previ- 
ous work including the researches we listed in section 
2. These features contain user's gender, age, home- 
town and blog usage frequency, resource uploading fre- 
quency and so on. The time-related features include 
the features that correlated with the recent psycho- 
logical state such as status or blog publishing count 
during recent one month. The emotion-related fea- 
tures mean the features that related with the emo- 
tion distribution (angry, funny, surprised and moving) 
of the user such as the top emotion count of all her 
blog. We would like to find out the emotion distri- 
bution of the user and select the top emotion count. 
These features stand for the user's emotion preferences 
and have a strong relation with her personality. The 
final feature class time&emotion-related features take 
both time and emotion into account which means the 
recent emotion tendency of the user such as the status 
emotion of the newest status and its emotion length. 
Emotion length means the time sustained of the recent 
emotion. 

Take a look at the emotion-related features, they need 
an emotion predictor. That's our previous work last 
year. Using Naive Bayesian method, the system is a 
combination of text classification with emotion dictio- 
nary. The key idea is increasing the weight of emotion 
token and decreasing non-emotion token while train- 
ing the model. We have already tested on a large scale 



of text content and get a high accuracy and recall rate 
over 80%. The result is to classify an article into dif- 
ferent emotion (angry, funny, surprised and moving) 
according to its content. 

3.3 SYSTEM DESIGN 

Our system comes from a combination of machine 
learning from Computer Science and Big Five Inven- 
tory from Psychology shown in figure below. 



through networking, even at their dormitories. In or- 
der to keep the quality of training samples, a testing 
fee is given to everyone after they finish the inventory. 
We advertise our experiment around Graduate Univer- 
sity of Chinese Academy of Sciences (GUCAS) and get 
335 participants in January and February 2012. Each 
of them is shown with the informed consent telling 
them that we will collect their Renren usage data. All 
the participants are friend or friend's friend of us from 
China with average 23.833 years old. However, the 
participants need to carefully finish the inventory, be 
active user in Renren with friends count over 100, sta- 
tus count over 50 and blog count over 10. Finally with 
these principles, we select 209 of them as our training 
dataset with 72 females and 137 males. 
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Figure 1: System Flowing Chart 

We call for the Application Program Interfaces (APIs) 
of Renren and collect users' online behaviors contain- 
ing user basic profile, basic functions usage frequency 
and blog/ status text content. In order to get all the in- 
formation, experiment users need to give us authoriza- 
tion for using APIs. We develop a Web Access Con- 
nection to Renren.com that allows users to authorize 
on Dao, an experiment platform we design. Then users 
need to finish Big Five Inventory and label her/his be- 
havior data with the inventory results. Finally, using 
data mining techniques, train a prediction model based 
on feature vectors. 

4 EXPERIMENT 
4.1 SAMPLES 

We have developed an experiment platform Dao in 
which participants can login with her/his Renren ac- 
count. Participants can do the experiment everywhere 



4.2 PRE-PROCESSING 

After collecting the behavior data of all the legal par- 
ticipants, we need to label each sample with five per- 
sonality dimensions score. However, the scores are 
continuous value ranged from one to five that can't be 
used directly for classification shown in figure 2 where 
the horizontal axis stands for participant IDs, the ver- 
tical axis stands for her/his one dimension score, A 
stands for agreeableness, C stands for conscientious- 
ness, E stands for extraversion, N stands for neuroti- 
cism and O stands for openness. 
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Figure 2: Personality Score Distribution 

In order to train our system using classification meth- 
ods in machine learning, we need to do the discretiza- 
tion on the initial scores and use the discrete values as 
data labels. The discretization functions we used are 
shown below: 



a 



E{x) 
E{x) 



■ ct(x), 
■a{x), 



For each dimensions, it means that we separate the 
label scores into three classes, low-score group 1 to 
a, middle-score group a to /? and high-score group 
/3 to 5, where E(x) is the mean value of personality 
score for dimension x, a{x) is the Standard variation of 
dimension x and x is choosen from E,A,C,N,0, the five 
dimensions for personality. Therefore, we change the 
label distribution from figure 2 to table 2 shown below. 
The three numbers in column sample count means the 
sample count in each group. For example dimension E, 
there are 62 samples in low-score group 1 ~ 2.31(a), 
92 samples in middle-score group 2.31 ^ 3.59(/?) and 
55 samples in high-score group 3.59 ^ 5. 



Table 2: Discretization 
PERSONALITY E{x) a{x) COUNT 



E 


2.95 


0.64 


62,92,55 


A 


3.71 


0.47 


69,67,73 


C 


3.29 


0.55 


72,78,59 


N 


3.02 


0.61 


54,89,66 





3.39 


0.61 


56,82,71 



4.3 MODEL TRAINING AND TESTING 

Until now, we have changed the whole work into a 
classification problem. We test the dataset on many 
classification algorithms such as Naive Bayesion (NB), 
Support Vector Machine (SVM) , Decision Tree and so 
on. We find out that C4.5 Decision Tree (Quinlan, 
1993) can get the best results. Using 10-fold cross 
validation, the results (precision, recall and F-value) 
of three-class classification problem for five personality 
dimensions are shown in table 3. 

Table 3: Three-Class Classification 



DIMENSION 


P 


R 


F-VALUE 


A 


0.725 


0.722 


0.723 


N 


0.713 


0.708 


0.710 


C 


0.702 


0.703 


0.701 


E 


0.718 


0.718 


0.717 





0.697 


0.694 


0.695 



We also consider removing the middle-score group and 
experiment on the low and high groups for each dimen- 
sion, that are 1 to a and /3 to 5 intervals. In this test- 
ing, we delete the "middle personality" samples and 
consider the two extreme personality cases only. That 
changes the work into a two-class classification prob- 
lem. Still using C4.5 Decision Tree, results are shown 
in table 4. Since the problem is simplified to a two- 



class classification problem, all the result quotas get a 
rise. 

Table 4: Two-Class Classification 
DIMENSION P R F-VALUE 



A 0.697 0.697 0.697 

N 0.749 0.750 0.749 

C 0.825 0.824 0.824 

E 0.839 0.838 0.838 

O 0.811 0.811 0.811 



We show the root part of the decision tree of dimension 
A in figure 3 since the whole tree is too long to dis- 
play. Feature zidou, the root node, means the virtual 
money of the user's account. The following 2nd root 
node selfcommentproportion means the proportion of 
the comment from herself. Blogemocticon means the 
count of emoticon used in the user's blogs. The 3rd 
root node recentstatustopemotionratio means the ra- 
tio of the majority emotion count in recent one month. 
For example, there are 10 pieces of status in the recent 
month of the user and 6 of them will make the reader 
angry, 4 of them will make reader happy. Recentsta- 
tustopemotionratio will be set 0.6. BloglYouIt means 
which person the user likes to use, I, you or it (includ- 
ing "he", "she" and "they"). 




Figure 3: 3-Class Decision Tree For A 



5 DISCUSSION 

From the above results, it is easy to see that different 
attribution has different weight and different dimen- 
sion has different high- weight attribution. That proves 
the difference of these five personality dimension and 
their behavior performances which are the strong ev- 
idence of the correlation of behavior and personality. 
We will give a discussion on the five dimensions from 
the view of the decision trees. As we all know, the 



features near the root nodes have a strong classifying 
contribution. 

5.1 RESULTS ANALYSIS 

The classifying algorithm we use is C4.5 decision tree 
which uses Gain Ratio (Han and Kamber, 2008) to 
extract features. The purpose of feature extraction is 
to find a splitting principle that can best predict the 
results. Gain ratio, a normalized information gain, 
stands for classification contribution of features. Call- 
ing for C4.5 decision tree, the algorithm calculates the 
gain ratio of each feature and set the feature with high- 
est gain ratio as the root node. Repeating this oper- 
ation, we can get a tree with high-gain-ratio features 
above and low-gain-ratio features below. Therefore, 
we list the root nodes (highest gain ratio features) and 
second root nodes (2nd gain ratio features) for five di- 
mensions in table 5, where p(x) means the proportion 
of variable x. 





Table 5: Strong Contribution Features 


D. 


LEVEL 


FEATURES 


A 


Root 


zidou 




2nd Root 


p(selfcomment), blogemoticon 


C 


Root 


age 




2nd Root 


p(friendcomment), guestbook 


E 


Root 


friend 




2nd Root 


blogemoticon, zzstatus 


N 


Root 


friend 




2nd Root 


usage, p(angryblog) 





Root 


friend 




2nd Root 


usage, recentstatus 



From the table, we can draw a lot of interesting 

and meaningful conclusions. Dimension agreeableness 
refers to being helpful, cooperative, and sympathetic 
towards others. People with high scores in agreeable- 
ness tend to send more blogs or emails (YANG, 2007). 
In SNS, this is reflected on interaction between users. 
For user having more virtual money (zidou), they are 
more likely to buy virtual gifts for others. A person 
with a high score on agreeableness tends to be more 
active on chatting online even others will discard her 
message. Therefore, their self-comment proportion is 
relatively high. Also in order to get the attention of 
others, they may be more likely to use emoticons (blo- 
gemoticon) in their blogs. 

Dimension conscientiousness is judged by being disci- 
plined, organized, and achievement-oriented. People 
using guestbook are most likely to call for some help 
from others such as asking for a location or an email- 
address. People having a high score in this dimension 



tend to be; helpful for others and will use guestbook 

more frequently. 

Dimension extraversion is displayed through a higher 
degree of sociability, assertiveness, and talkativeness. 
It is easy to find that people with more friends(friend) 
is more likely to be extraversive. For an extraversive 
person, she may tend to use emoticon in her blogs to 
show her character. She has many friends and is happy 
to talk with others even to republish (zzstatus) others' 
statuses. 

Dimension neuroticism refers to degree of emotional 
stability, impulse control, and anxiety. Clearly, people 
that have a high score in this dimension tend to be 
easily angry for other things. Therefore, their blogs 
may have a high proportion of making readers angry. 
That is positively related to the proportion of angry 
blogs (angryblogproportion). 

Dimension openness is refiected in a strong intellec- 
tual curiosity and a preference for novelty and vari- 
ety. People that are curious with others tend to make 
many new friends and have a high SNS usage experi- 
ences (usage). Their statuses tend to be updated fre- 
quently which means their recent status count (re- 
centstatus) is relatively high. As in Correa's work 
(Corrca, 2010), openness is positively correlated with 
SNS usage which holds our results. 

5.2 CONCLUSION AND FUTURE 
RESEARCH 

The automatic personality predicting can open a new 
window not only for computer science but also for 
psychology. SNS servicer can recommend resources 
based on user's personality in the future. For outgo- 
ing users, he may prefer international news and like to 
make friends with others, which will be the guides for 
networking service suppliers. Also, an objective data 
selection strategy is given to psychological experiments 
which will increase their quality levels and confidence 
degrees. This system can also be used into online men- 
tal illness treatment. For extroverted patients, she/he 
may easily explain her/his illness to the doctor with- 
out lies while introverted patient may speak a little 
about her/his illness which calls for doctor's patience. 

We will continue our work on this Cross discipline 
topic. To make the whole system better, we may con- 
sider the correlation of these five personality dimen- 
sions, since these five dimensions are not absolutely or- 
thogonal. We may use multi-task learning techniques 
to fix our training algorithm. At the same time, the 
consistency of online behavior and offline behavior is 
another interest for us. Although the networking de- 
veloping tendency is to build a virtual world quite 
same as nature world, there are still some differences 



on user behavior between online and offline. Most on- 
line services are based on the real-name system, but 
it is still not face-to-face. Users do not need to con- 
sider cases of losing face which in nature world consider 
highly. We believe there must be but a small difference 
between online and offline behavior. 
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